Skip to content

doc: add example of RowFilter usage#9115

Merged
alamb merged 1 commit intoapache:mainfrom
sonhmai:doc/row-filter-usage-9096
Jan 13, 2026
Merged

doc: add example of RowFilter usage#9115
alamb merged 1 commit intoapache:mainfrom
sonhmai:doc/row-filter-usage-9096

Conversation

@sonhmai
Copy link
Copy Markdown
Contributor

@sonhmai sonhmai commented Jan 8, 2026

Which issue does this PR close?

Rationale for this change

The RowFilter API does exist and can evaluate predicates during evaluation, but it has no examples.

What changes are included in this PR?

  • Added a rustdoc example and blog link to ParquetRecordBatchReaderBuilder::with_row_filter.
  • Added a running example in parquet/examples/read_with_row_filter.rs

Are these changes tested?

Yes

cargo run -p parquet --example read_with_row_filter
cargo test -p parquet --doc

Are there any user-facing changes?

Yes, doc only. No API changes.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 8, 2026
@sonhmai sonhmai force-pushed the doc/row-filter-usage-9096 branch from 37be4e1 to f286dfd Compare January 8, 2026 06:59
@sonhmai sonhmai changed the title doc: add example of RowFilter usage draft: doc: add example of RowFilter usage Jan 8, 2026
@sonhmai sonhmai force-pushed the doc/row-filter-usage-9096 branch from f286dfd to bc8e06f Compare January 8, 2026 07:32
@sonhmai sonhmai changed the title draft: doc: add example of RowFilter usage doc: add example of RowFilter usage Jan 8, 2026
@sonhmai
Copy link
Copy Markdown
Contributor Author

sonhmai commented Jan 8, 2026

@alamb would you mind reviewing this? Thanks!

use parquet::errors::Result;
use std::fs::File;

// RowFilter / with_row_filter usage. For background and more
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we better off removing this and keeping only the doctest to reduce duplication?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree -- I think the doc examples are easier to find so I recommend removing this example file

Actually, looking at the existing examples I think many of them are redundant / would be easier to find if we moved them into the documentation:
https://github.com/apache/arrow-rs/tree/main/parquet/examples

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// more efficient skipping over data pages. See [`ArrowReaderOptions::with_page_index`].
///
/// For a running example see `parquet/examples/read_with_row_filter.rs`.
/// See <https://arrow.apache.org/blog/2025/12/11/parquet-late-materialization-deep-dive/>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// See the [blog post on late materialization] for a more technical explanation.
///
/// ...
///
/// [blog post on late materialization]: https://arrow.apache.org/blog/2025/12/11/parquet-late-materialization-deep-dive/

Slightly nice formatting this way

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sonhmai and @Jefffrey -- this is great work and a nice addition.

I think @Jefffrey and my suggestions would make this PR better, but I also think we could merge it as is and iterate as a follow on too. Just let us know what you would like to do @sonhmai

use parquet::errors::Result;
use std::fs::File;

// RowFilter / with_row_filter usage. For background and more
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree -- I think the doc examples are easier to find so I recommend removing this example file

Actually, looking at the existing examples I think many of them are redundant / would be easier to find if we moved them into the documentation:
https://github.com/apache/arrow-rs/tree/main/parquet/examples

/// more efficient skipping over data pages. See [`ArrowReaderOptions::with_page_index`].
///
/// For a running example see `parquet/examples/read_with_row_filter.rs`.
/// See <https://arrow.apache.org/blog/2025/12/11/parquet-late-materialization-deep-dive/>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

/// let builder = ParquetRecordBatchReaderBuilder::try_new(file)?;
/// let schema_desc = builder.metadata().file_metadata().schema_descr_ptr();
///
/// // Create predicate: column id > 4. This col has index 0.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// // Create predicate: column id > 4. This col has index 0.
/// // Create predicate that evaluates `id > 4`. The `id` column has index 0.

/// // Create predicate: column id > 4. This col has index 0.
/// let projection = ProjectionMask::leaves(&schema_desc, [0]);
/// let predicate = ArrowPredicateFn::new(projection, |batch| {
/// let id_col = batch.column(0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a minor suggestion, I think it would make a nicer example if you picked a different column from the file other than 0 so that it is clear the batch passed to the predicate only contains the selected projection column

For example, perhaps you could use the int_col (column index 4)

> select * from './parquet-testing/data/alltypes_plain.parquet';
+----+----------+-------------+--------------+---------+------------+-----------+------------+------------------+------------+---------------------+
| id | bool_col | tinyint_col | smallint_col | int_col | bigint_col | float_col | double_col | date_string_col  | string_col | timestamp_col       |
+----+----------+-------------+--------------+---------+------------+-----------+------------+------------------+------------+---------------------+
| 4  | true     | 0           | 0            | 0       | 0          | 0.0       | 0.0        | 30332f30312f3039 | 30         | 2009-03-01T00:00:00 |
| 5  | false    | 1           | 1            | 1       | 10         | 1.1       | 10.1       | 30332f30312f3039 | 31         | 2009-03-01T00:01:00 |
| 6  | true     | 0           | 0            | 0       | 0          | 0.0       | 0.0        | 30342f30312f3039 | 30         | 2009-04-01T00:00:00 |
| 7  | false    | 1           | 1            | 1       | 10         | 1.1       | 10.1       | 30342f30312f3039 | 31         | 2009-04-01T00:01:00 |
| 2  | true     | 0           | 0            | 0       | 0          | 0.0       | 0.0        | 30322f30312f3039 | 30         | 2009-02-01T00:00:00 |
| 3  | false    | 1           | 1            | 1       | 10         | 1.1       | 10.1       | 30322f30312f3039 | 31         | 2009-02-01T00:01:00 |
| 0  | true     | 0           | 0            | 0       | 0          | 0.0       | 0.0        | 30312f30312f3039 | 30         | 2009-01-01T00:00:00 |
| 1  | false    | 1           | 1            | 1       | 10         | 1.1       | 10.1       | 30312f30312f3039 | 31         | 2009-01-01T00:01:00 |
+----+----------+-------------+--------------+---------+------------+-----------+------------+------------------+------------+---------------------+
8 row(s) fetched.
Elapsed 0.039 seconds.

> describe './parquet-testing/data/alltypes_plain.parquet';
+-----------------+---------------+-------------+
| column_name     | data_type     | is_nullable |
+-----------------+---------------+-------------+
| id              | Int32         | YES         |
| bool_col        | Boolean       | YES         |
| tinyint_col     | Int32         | YES         |
| smallint_col    | Int32         | YES         |
| int_col         | Int32         | YES         |
| bigint_col      | Int64         | YES         |
| float_col       | Float32       | YES         |
| double_col      | Float64       | YES         |
| date_string_col | BinaryView    | YES         |
| string_col      | BinaryView    | YES         |
| timestamp_col   | Timestamp(ns) | YES         |
+-----------------+---------------+-------------+
11 row(s) fetched.
Elapsed 0.005 seconds.

@alamb alamb merged commit 4ddaa8c into apache:main Jan 13, 2026
18 checks passed
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jan 13, 2026

I made a follow on PR with small improvements

alamb added a commit that referenced this pull request Jan 14, 2026
# Which issue does this PR close?

- part of #9096
- Follow on to #9115

# Rationale for this change

@sonhmai started us off with
#9115

@Jefffrey and I had some suggestions on the PR and I found some more
while going through it again, so I figured I would make anew PR

# What changes are included in this PR?

1. Improve the documentation
2. Improve the doc comment example
3. Remove redundant example in parquet/examples/read_with_row_filter.rs

# Are these changes tested?

By CI

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

---------

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
@sonhmai sonhmai deleted the doc/row-filter-usage-9096 branch January 15, 2026 05:43
Dandandan pushed a commit to Dandandan/arrow-rs that referenced this pull request Jan 15, 2026
# Which issue does this PR close?

- Closes apache#9096.

# Rationale for this change

The RowFilter API does exist and can evaluate predicates during
evaluation, but it has no examples.

# What changes are included in this PR?

- Added a rustdoc example and blog link to
`ParquetRecordBatchReaderBuilder::with_row_filter`.
- Added a running example in `parquet/examples/read_with_row_filter.rs`

# Are these changes tested?

Yes 
```
cargo run -p parquet --example read_with_row_filter
cargo test -p parquet --doc
```

# Are there any user-facing changes?

Yes, doc only. No API changes.
Dandandan pushed a commit to Dandandan/arrow-rs that referenced this pull request Jan 15, 2026
)

# Which issue does this PR close?

- part of apache#9096
- Follow on to apache#9115

# Rationale for this change

@sonhmai started us off with
apache#9115

@Jefffrey and I had some suggestions on the PR and I found some more
while going through it again, so I figured I would make anew PR

# What changes are included in this PR?

1. Improve the documentation
2. Improve the doc comment example
3. Remove redundant example in parquet/examples/read_with_row_filter.rs

# Are these changes tested?

By CI

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

---------

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Document / Add an example of RowFilter usage

3 participants